{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "pre-prosses.ipynb",
      "provenance": [],
      "collapsed_sections": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "#52-文本预处理\n",
        "##本节目录\n",
        "\n",
        "*   [处理步骤](https://colab.research.google.com/drive/1oQH-UYj-DhB-DiUJ3nO2dEprKcNSD4F0#scrollTo=ltSVAkJi-SVF&line=6&uniqifier=1)\n",
        "*   [读取数据集](https://colab.research.google.com/drive/1oQH-UYj-DhB-DiUJ3nO2dEprKcNSD4F0#scrollTo=BE1LFmRneFrp)\n",
        "*   [词元化](https://colab.research.google.com/drive/1oQH-UYj-DhB-DiUJ3nO2dEprKcNSD4F0#scrollTo=FyvsoyYqeMuW&line=6&uniqifier=1)\n",
        "*   [词表](https://colab.research.google.com/drive/1oQH-UYj-DhB-DiUJ3nO2dEprKcNSD4F0#scrollTo=aTVML0Pqeqb1&line=5&uniqifier=1)\n",
        "*   [整合所有功能](https://colab.research.google.com/drive/1oQH-UYj-DhB-DiUJ3nO2dEprKcNSD4F0#scrollTo=cKrAEBT7gplL&line=6&uniqifier=1)\n",
        "\n",
        "\n"
      ],
      "metadata": {
        "id": "wTVkRfB2hnx8"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "##启用GPU"
      ],
      "metadata": {
        "id": "4F0oSvy211p_"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "%tensorflow_version 2.x\n",
        "import tensorflow as tf\n",
        "device_name = tf.test.gpu_device_name()\n",
        "if device_name != '/device:GPU:0':\n",
        "  raise SystemError('GPU device not found')\n",
        "print('Found GPU at: {}'.format(device_name))"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "nkcC6rZz1xGH",
        "outputId": "197e037a-b939-4fef-a19e-ca24b9f1b851"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Found GPU at: /device:GPU:0\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "##安装pytorch和d2l"
      ],
      "metadata": {
        "id": "byK-3lVy2BIU"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!pip3 install torch torchvision torchaudio"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "GiXMVWXv2IK4",
        "outputId": "7b6d0598-2fcc-4b04-a347-87ecf6839065"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Requirement already satisfied: torch in /usr/local/lib/python3.7/dist-packages (1.10.0+cu111)\n",
            "Requirement already satisfied: torchvision in /usr/local/lib/python3.7/dist-packages (0.11.1+cu111)\n",
            "Requirement already satisfied: torchaudio in /usr/local/lib/python3.7/dist-packages (0.10.0+cu111)\n",
            "Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from torch) (3.10.0.2)\n",
            "Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from torchvision) (1.19.5)\n",
            "Requirement already satisfied: pillow!=8.3.0,>=5.3.0 in /usr/local/lib/python3.7/dist-packages (from torchvision) (7.1.2)\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "!pip3 install d2l==0.14"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "VZQ-iQON2KdN",
        "outputId": "62c942aa-bf75-4277-d03c-1cbb83cbf5d1"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Collecting d2l==0.14\n",
            "  Downloading d2l-0.14.0-py3-none-any.whl (48 kB)\n",
            "\u001b[?25l\r\u001b[K     |██████▊                         | 10 kB 29.0 MB/s eta 0:00:01\r\u001b[K     |█████████████▍                  | 20 kB 35.7 MB/s eta 0:00:01\r\u001b[K     |████████████████████            | 30 kB 19.3 MB/s eta 0:00:01\r\u001b[K     |██████████████████████████▉     | 40 kB 16.7 MB/s eta 0:00:01\r\u001b[K     |████████████████████████████████| 48 kB 4.7 MB/s \n",
            "\u001b[?25hRequirement already satisfied: jupyter in /usr/local/lib/python3.7/dist-packages (from d2l==0.14) (1.0.0)\n",
            "Requirement already satisfied: matplotlib in /usr/local/lib/python3.7/dist-packages (from d2l==0.14) (3.2.2)\n",
            "Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from d2l==0.14) (1.3.5)\n",
            "Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from d2l==0.14) (1.19.5)\n",
            "Requirement already satisfied: nbconvert in /usr/local/lib/python3.7/dist-packages (from jupyter->d2l==0.14) (5.6.1)\n",
            "Requirement already satisfied: notebook in /usr/local/lib/python3.7/dist-packages (from jupyter->d2l==0.14) (5.3.1)\n",
            "Requirement already satisfied: jupyter-console in /usr/local/lib/python3.7/dist-packages (from jupyter->d2l==0.14) (5.2.0)\n",
            "Requirement already satisfied: qtconsole in /usr/local/lib/python3.7/dist-packages (from jupyter->d2l==0.14) (5.2.2)\n",
            "Requirement already satisfied: ipywidgets in /usr/local/lib/python3.7/dist-packages (from jupyter->d2l==0.14) (7.6.5)\n",
            "Requirement already satisfied: ipykernel in /usr/local/lib/python3.7/dist-packages (from jupyter->d2l==0.14) (4.10.1)\n",
            "Requirement already satisfied: ipython>=4.0.0 in /usr/local/lib/python3.7/dist-packages (from ipykernel->jupyter->d2l==0.14) (5.5.0)\n",
            "Requirement already satisfied: tornado>=4.0 in /usr/local/lib/python3.7/dist-packages (from ipykernel->jupyter->d2l==0.14) (5.1.1)\n",
            "Requirement already satisfied: jupyter-client in /usr/local/lib/python3.7/dist-packages (from ipykernel->jupyter->d2l==0.14) (5.3.5)\n",
            "Requirement already satisfied: traitlets>=4.1.0 in /usr/local/lib/python3.7/dist-packages (from ipykernel->jupyter->d2l==0.14) (5.1.1)\n",
            "Requirement already satisfied: pexpect in /usr/local/lib/python3.7/dist-packages (from ipython>=4.0.0->ipykernel->jupyter->d2l==0.14) (4.8.0)\n",
            "Requirement already satisfied: pygments in /usr/local/lib/python3.7/dist-packages (from ipython>=4.0.0->ipykernel->jupyter->d2l==0.14) (2.6.1)\n",
            "Requirement already satisfied: setuptools>=18.5 in /usr/local/lib/python3.7/dist-packages (from ipython>=4.0.0->ipykernel->jupyter->d2l==0.14) (57.4.0)\n",
            "Requirement already satisfied: simplegeneric>0.8 in /usr/local/lib/python3.7/dist-packages (from ipython>=4.0.0->ipykernel->jupyter->d2l==0.14) (0.8.1)\n",
            "Requirement already satisfied: pickleshare in /usr/local/lib/python3.7/dist-packages (from ipython>=4.0.0->ipykernel->jupyter->d2l==0.14) (0.7.5)\n",
            "Requirement already satisfied: decorator in /usr/local/lib/python3.7/dist-packages (from ipython>=4.0.0->ipykernel->jupyter->d2l==0.14) (4.4.2)\n",
            "Requirement already satisfied: prompt-toolkit<2.0.0,>=1.0.4 in /usr/local/lib/python3.7/dist-packages (from ipython>=4.0.0->ipykernel->jupyter->d2l==0.14) (1.0.18)\n",
            "Requirement already satisfied: wcwidth in /usr/local/lib/python3.7/dist-packages (from prompt-toolkit<2.0.0,>=1.0.4->ipython>=4.0.0->ipykernel->jupyter->d2l==0.14) (0.2.5)\n",
            "Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.7/dist-packages (from prompt-toolkit<2.0.0,>=1.0.4->ipython>=4.0.0->ipykernel->jupyter->d2l==0.14) (1.15.0)\n",
            "Requirement already satisfied: widgetsnbextension~=3.5.0 in /usr/local/lib/python3.7/dist-packages (from ipywidgets->jupyter->d2l==0.14) (3.5.2)\n",
            "Requirement already satisfied: jupyterlab-widgets>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from ipywidgets->jupyter->d2l==0.14) (1.0.2)\n",
            "Requirement already satisfied: nbformat>=4.2.0 in /usr/local/lib/python3.7/dist-packages (from ipywidgets->jupyter->d2l==0.14) (5.1.3)\n",
            "Requirement already satisfied: ipython-genutils~=0.2.0 in /usr/local/lib/python3.7/dist-packages (from ipywidgets->jupyter->d2l==0.14) (0.2.0)\n",
            "Requirement already satisfied: jupyter-core in /usr/local/lib/python3.7/dist-packages (from nbformat>=4.2.0->ipywidgets->jupyter->d2l==0.14) (4.9.1)\n",
            "Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in /usr/local/lib/python3.7/dist-packages (from nbformat>=4.2.0->ipywidgets->jupyter->d2l==0.14) (4.3.3)\n",
            "Requirement already satisfied: attrs>=17.4.0 in /usr/local/lib/python3.7/dist-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets->jupyter->d2l==0.14) (21.4.0)\n",
            "Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets->jupyter->d2l==0.14) (4.10.1)\n",
            "Requirement already satisfied: importlib-resources>=1.4.0 in /usr/local/lib/python3.7/dist-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets->jupyter->d2l==0.14) (5.4.0)\n",
            "Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets->jupyter->d2l==0.14) (3.10.0.2)\n",
            "Requirement already satisfied: pyrsistent!=0.17.0,!=0.17.1,!=0.17.2,>=0.14.0 in /usr/local/lib/python3.7/dist-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets->jupyter->d2l==0.14) (0.18.1)\n",
            "Requirement already satisfied: zipp>=3.1.0 in /usr/local/lib/python3.7/dist-packages (from importlib-resources>=1.4.0->jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets->jupyter->d2l==0.14) (3.7.0)\n",
            "Requirement already satisfied: jinja2 in /usr/local/lib/python3.7/dist-packages (from notebook->jupyter->d2l==0.14) (2.11.3)\n",
            "Requirement already satisfied: terminado>=0.8.1 in /usr/local/lib/python3.7/dist-packages (from notebook->jupyter->d2l==0.14) (0.13.1)\n",
            "Requirement already satisfied: Send2Trash in /usr/local/lib/python3.7/dist-packages (from notebook->jupyter->d2l==0.14) (1.8.0)\n",
            "Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.7/dist-packages (from jupyter-client->ipykernel->jupyter->d2l==0.14) (2.8.2)\n",
            "Requirement already satisfied: pyzmq>=13 in /usr/local/lib/python3.7/dist-packages (from jupyter-client->ipykernel->jupyter->d2l==0.14) (22.3.0)\n",
            "Requirement already satisfied: ptyprocess in /usr/local/lib/python3.7/dist-packages (from terminado>=0.8.1->notebook->jupyter->d2l==0.14) (0.7.0)\n",
            "Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.7/dist-packages (from jinja2->notebook->jupyter->d2l==0.14) (2.0.1)\n",
            "Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->d2l==0.14) (1.3.2)\n",
            "Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->d2l==0.14) (3.0.7)\n",
            "Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib->d2l==0.14) (0.11.0)\n",
            "Requirement already satisfied: mistune<2,>=0.8.1 in /usr/local/lib/python3.7/dist-packages (from nbconvert->jupyter->d2l==0.14) (0.8.4)\n",
            "Requirement already satisfied: entrypoints>=0.2.2 in /usr/local/lib/python3.7/dist-packages (from nbconvert->jupyter->d2l==0.14) (0.4)\n",
            "Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.7/dist-packages (from nbconvert->jupyter->d2l==0.14) (1.5.0)\n",
            "Requirement already satisfied: bleach in /usr/local/lib/python3.7/dist-packages (from nbconvert->jupyter->d2l==0.14) (4.1.0)\n",
            "Requirement already satisfied: testpath in /usr/local/lib/python3.7/dist-packages (from nbconvert->jupyter->d2l==0.14) (0.5.0)\n",
            "Requirement already satisfied: defusedxml in /usr/local/lib/python3.7/dist-packages (from nbconvert->jupyter->d2l==0.14) (0.7.1)\n",
            "Requirement already satisfied: packaging in /usr/local/lib/python3.7/dist-packages (from bleach->nbconvert->jupyter->d2l==0.14) (21.3)\n",
            "Requirement already satisfied: webencodings in /usr/local/lib/python3.7/dist-packages (from bleach->nbconvert->jupyter->d2l==0.14) (0.5.1)\n",
            "Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas->d2l==0.14) (2018.9)\n",
            "Requirement already satisfied: qtpy in /usr/local/lib/python3.7/dist-packages (from qtconsole->jupyter->d2l==0.14) (2.0.1)\n",
            "Installing collected packages: d2l\n",
            "Successfully installed d2l-0.14.0\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "#文本预处理\n",
        "##处理步骤\n",
        "\n",
        "---\n",
        "\n",
        "\n",
        "\n",
        "1.   将文本作为字符串加载至内存\n",
        "2.   将字符串拆分为词元（eg.单词和字符）\n",
        "3.   建立词表，将拆分的词元映射到数字索引\n",
        "4.   将文本转换为数字索引序列，方便模型操作\n",
        "\n"
      ],
      "metadata": {
        "id": "ltSVAkJi-SVF"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import collections\n",
        "import re\n",
        "from d2l import torch as d2l"
      ],
      "metadata": {
        "id": "DQBfRqj5-uVZ"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "##读取数据集"
      ],
      "metadata": {
        "id": "BE1LFmRneFrp"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "d2l.DATA_HUB['time_machine'] = (d2l.DATA_URL + 'timemachine.txt', '0cb91d09b814ecdc07b50f31f8dcad3e81d6a86d')\n",
        "\n",
        "#清洗数据\n",
        "#消除无用文本，转换大小写\n",
        "def read_time_machine():\n",
        "    \"\"\"将时间机器数据集加载到文本行的列表中\"\"\"\n",
        "    with open(d2l.download('time_machine'),'r') as f:\n",
        "        lines = f.readlines()\n",
        "    return [re.sub('[^A-Za-z]+',' ',line).strip().lower() for line in lines]\n",
        "\n",
        "lines = read_time_machine()\n",
        "print(f'# 文本总行数：{len(lines)}')\n",
        "print(lines[0])\n",
        "print(lines[10])"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "U80crdJH-6EL",
        "outputId": "d940c724-b41c-434b-bc44-cd8b0c1ebb80"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Downloading ../data/timemachine.txt from http://d2l-data.s3-accelerate.amazonaws.com/timemachine.txt...\n",
            "# 文本总行数：3221\n",
            "the time machine by h g wells\n",
            "twinkled and his usually pale face was flushed and animated the\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "##词元化\n",
        "\n",
        "\n",
        "---\n",
        "\n",
        "\n",
        "\n",
        "> 文本行列表中的每个文本序列被转化成以词元为单位的词元列表\n",
        "\n",
        "\n",
        "\n"
      ],
      "metadata": {
        "id": "FyvsoyYqeMuW"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "#拆分数据为单词或字母\n",
        "def tokenize(lines,token='word'):\n",
        "    \"\"\"将文本行拆分为单词或字符词元\"\"\"\n",
        "    if token == 'word':\n",
        "        return [line.split() for line in lines]\n",
        "    elif token == 'char':\n",
        "        return [list(line) for line in lines]\n",
        "    else:\n",
        "        print('error: Unkown type:' + token)\n",
        "\n",
        "tokens = tokenize(lines)\n",
        "for i in range(11):\n",
        "    print(tokens[i])"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "KCWY6qVr0i_b",
        "outputId": "58ffaee1-97b3-48d8-a973-1db9795c007e"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "['the', 'time', 'machine', 'by', 'h', 'g', 'wells']\n",
            "[]\n",
            "[]\n",
            "[]\n",
            "[]\n",
            "['i']\n",
            "[]\n",
            "[]\n",
            "['the', 'time', 'traveller', 'for', 'so', 'it', 'will', 'be', 'convenient', 'to', 'speak', 'of', 'him']\n",
            "['was', 'expounding', 'a', 'recondite', 'matter', 'to', 'us', 'his', 'grey', 'eyes', 'shone', 'and']\n",
            "['twinkled', 'and', 'his', 'usually', 'pale', 'face', 'was', 'flushed', 'and', 'animated', 'the']\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "##词表\n",
        "\n",
        "\n",
        "---\n",
        "\n",
        "\n",
        "> 词表(vocabulary):是一个能让词元和数字索引互相映射的字典\n",
        "\n",
        "\n",
        "\n",
        "> 语料(corpus):是文本中各个词元的出现频率的统计结果\n",
        "\n",
        "\n",
        "\n",
        "> 根据词元出现频率分配数字索引，同时移除较少的词元降低词表的稀疏程度\n",
        "\n",
        "\n",
        "\n",
        "> 特殊词元:\n",
        "1.   未知词元<unk>\n",
        "2.   填充词元<pad>\n",
        "3.   序列开始词元<bos>\n",
        "4.   序列结束词元<eos>\n",
        "\n",
        "\n",
        "\n",
        "\n",
        "\n",
        "\n",
        "\n"
      ],
      "metadata": {
        "id": "aTVML0Pqeqb1"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "class Vocab:\n",
        "    \"\"\"文本词表\"\"\"\n",
        "    def __init__(self,tokens=None,min_freq=0,reserved_tokens=None):\n",
        "        if tokens is None:\n",
        "            tokens = []\n",
        "        if reserved_tokens is None:\n",
        "            reserved_tokens = []\n",
        "        #按出现频率排序\n",
        "        counter = count_corpus(tokens)\n",
        "        self._token_freqs = sorted(counter.items(),key=lambda x: x[1],reverse=True)\n",
        "        #未知词元的索引为0\n",
        "        self.idx_to_token = ['<unk>'] + reserved_tokens\n",
        "        self.token_to_idx = {token: idx for idx,token in enumerate(self.idx_to_token)}\n",
        "        self.idx_to_token,self.token_to_idx = [],dict()\n",
        "        print(self.token_to_idx)\n",
        "        for token,freq in self._token_freqs:\n",
        "            if freq < min_freq:\n",
        "                break\n",
        "            if token not in self.token_to_idx:\n",
        "                self.idx_to_token.append(token)\n",
        "                self.token_to_idx[token] = len(self.idx_to_token) - 1\n",
        "\n",
        "    def __len__(self):\n",
        "        return len(self.idx_to_token)\n",
        "\n",
        "    def __getitem__(self,tokens):\n",
        "        if not isinstance(tokens,(list,tuple)):\n",
        "            return self.token_to_idx.get(tokens,self.unk)\n",
        "        return [self.__getitem__(token) for token in tokens]\n",
        "\n",
        "    def to_tokens(self,indices):\n",
        "        if not isinstance(indices,(list,tuple)):\n",
        "            return self.idx_to_token[indices]\n",
        "        return [self.idx_to_token[index] for index in indices]\n",
        "\n",
        "    @property\n",
        "    def unk(self): #未知词元的索引为0\n",
        "        return 0\n",
        "\n",
        "    @property\n",
        "    def token_freqs(self):\n",
        "        return self._token_freqs\n",
        "\n",
        "def count_corpus(tokens):\n",
        "    \"\"\"统计词元的频率\"\"\"\n",
        "    #这里的tokens是1D列表或2D列表\n",
        "    if len(tokens) == 0 or isinstance(tokens[0],list):\n",
        "        #将词元列表展平成一个列表\n",
        "        tokens = [token for line in tokens for token in line]\n",
        "    return collections.Counter(tokens)"
      ],
      "metadata": {
        "id": "carzgtSy7H4n"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "vocab = Vocab(tokens)\n",
        "print(list(vocab.token_to_idx.items())[:10])"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "-NskhH8ICP0b",
        "outputId": "80e6d765-eb8a-47f8-ca73-b6a38cabee3f"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "{}\n",
            "[('the', 0), ('i', 1), ('and', 2), ('of', 3), ('a', 4), ('to', 5), ('was', 6), ('in', 7), ('that', 8), ('my', 9)]\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "for i in [0,10]:\n",
        "    print('token:',tokens[i])\n",
        "    print('id:',vocab[tokens[i]])"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "zT9px1-TCZ4H",
        "outputId": "16e6c9a6-58f5-4826-b6b9-4a2ce3857d66"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "token: ['the', 'time', 'machine', 'by', 'h', 'g', 'wells']\n",
            "id: [0, 18, 49, 39, 2182, 2183, 399]\n",
            "token: ['twinkled', 'and', 'his', 'usually', 'pale', 'face', 'was', 'flushed', 'and', 'animated', 'the']\n",
            "id: [2185, 2, 24, 1043, 361, 112, 6, 1420, 2, 1044, 0]\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "##整合所有功能\n",
        "\n",
        "\n",
        "---\n",
        "\n",
        "\n",
        "\n",
        "1.   读取清洗后的文件\n",
        "2.   词元化\n",
        "3.   建立词表\n",
        "\n"
      ],
      "metadata": {
        "id": "cKrAEBT7gplL"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "def load_corpus_tine_machine(max_tokens=-1):\n",
        "    \"\"\"返回数据集的词元索引列表和词表\"\"\"\n",
        "    lines = read_time_machine()\n",
        "    tokens = tokenize(lines,'char')\n",
        "    vocab = Vocab(tokens)\n",
        "    \"\"\"将所有文本行展平到一个列表\"\"\"\n",
        "    corpus = [vocab[token] for line in tokens for token in line]\n",
        "    if max_tokens > 0 :\n",
        "        corpus = corpus[:max_tokens]\n",
        "    return corpus,vocab\n",
        "\n",
        "corpus,vocab = d2l.load_corpus_time_machine()\n",
        "len(corpus),len(vocab)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "v1mgkKD6CvE_",
        "outputId": "2fba3c07-3bb9-46c2-e1fc-e0c5fccba128"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Downloading ../data/timemachine.txt from http://d2l-data.s3-accelerate.amazonaws.com/timemachine.txt...\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "(171489, 28)"
            ]
          },
          "metadata": {},
          "execution_count": 16
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        ""
      ],
      "metadata": {
        "id": "t0m8D6NQFF8T"
      },
      "execution_count": null,
      "outputs": []
    }
  ]
}